Add Anima modular pipeline by rmatif · Pull Request #13732 · huggingface/diffusers

rmatif · 2026-05-12T16:24:51Z

What does this PR do?

Adds modular-only support for Anima, a text-to-image model built on top of the Cosmos Predict2 DiT architecture.

This PR adds:

AnimaModularPipeline and AnimaAutoBlocks
AnimaTextConditioner
checkpoint conversion for the original Anima weights
LoRA loading support for the Cosmos transformer and Anima text conditioner
docs and fast modular pipeline tests

Converted weights:
https://huggingface.co/mrfatso/anima-preview3-diffusers

Fixes #13067

cc @tdrussell

Testing

uv run pytest tests/modular_pipelines/anima/test_modular_pipeline_anima.py -q
uv run ruff check src/diffusers/modular_pipelines/anima src/diffusers/models/transformers/transformer_anima.py scripts/convert_anima_to_diffusers.py tests/modular_pipelines/anima/test_modular_pipeline_anima.py
uv run python utils/check_dummies.py
uv run python utils/modular_auto_docstring.py src/diffusers/modular_pipelines/anima

Tested the converted checkpoint locally with txt2img generation and LoRA loading

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? [Feature] Add support for Anima #13067
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@yiyixuxu @asomoza

tdrussell · 2026-05-12T17:23:12Z

I'm the creator of Anima. I've run this and confirmed it looks good in terms of correct model implementation and producing images that match the quality of other implementations. I'll let the Diffusers team comment on any Diffusers-specific questions or concerns.

asomoza · 2026-05-12T17:25:25Z

@rmatif thanks a lot for this, can you please post an example code and a couple of generations with this?

rmatif · 2026-05-12T21:21:09Z

can you please post an example code and a couple of generations with this?

Sure, here is a small txt2img example using the converted checkpoint:

import torch
from diffusers import AnimaPipeline

pipe = AnimaPipeline.from_pretrained(
    "mrfatso/anima-preview3-diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = (
    "masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
    "long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
    "looking at viewer, upper body, detailed background"
)
negative_prompt = (
    "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
    "sepia, signature, artist name"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=25,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(12341),
).images[0]

image.save("anima.png")

sayakpaul · 2026-05-13T00:39:02Z

@claude can you do a review?

github-actions · 2026-05-13T00:39:23Z

Claude finished @sayakpaul's task in 4m 49s —— View job

Review of PR #13732: Add Anima pipeline

Read repository guidelines (AGENTS.md, models.md, pipelines.md, modular.md)
Review modeling_anima.py (AnimaTextConditioner)
Review pipeline_anima.py (AnimaPipeline)
Review __init__.py changes
Review LoRA integration (lora_pipeline.py, lora_conversion_utils.py)
Review tests
Review docs
Trace call path & check for dead code
Post final review

Overall this is a clean PR that reuses existing infra well (Cosmos transformer, QwenImage VAE, FlowMatchEuler scheduler). The code is readable and the LoRA conversion logic is well-structured. A few issues to fix before merge:

Blocking issues

1. `_toctree.yml` — broken YAML structure (docs won't build)

docs/source/en/_toctree.yml lines 502-506 use tab characters for indentation (the rest of the file uses spaces), and the Anima + AnimateDiff entries are structurally nested under the "Audio" section title (line 501). Anima is an image pipeline and should be a sibling of AuraFlow, Bria, etc., not a child of Audio.

The insertion also appears to have displaced the AnimateDiff entry from its original position.

Fix: insert the Anima entry at the same indentation level as the other image pipelines (e.g. between the Audio section and AuraFlow), using spaces, not tabs. AnimateDiff should remain in its original position.

2. Missing dummy objects

AnimaPipeline and AnimaTextConditioner need entries in src/diffusers/utils/dummy_torch_and_transformers_objects.py. Without them, from diffusers import AnimaPipeline will fail with a confusing ImportError when torch or transformers isn't installed, instead of the friendly "requires torch and transformers" message. Every other pipeline has these — see AnimateDiffPipeline at line 845 for the pattern.

Fix this →

3. `AnimaTextConditioner` import ordering in `src/diffusers/init.py`

In the _import_structure dict (line 505-506), AnimaPipeline and AnimaTextConditioner are adjacent. But in the TYPE_CHECKING block, AnimaTextConditioner is at line 1314 (after all the AnimateDiff entries), separated from AnimaPipeline at line 1306. Should be right after AnimaPipeline to keep alphabetical consistency with the lazy import structure. Run make style to fix.

Non-blocking suggestions

4. `F.scaled_dot_product_attention` in `AnimaTextConditionerAttention`

modeling_anima.py:114 calls F.scaled_dot_product_attention directly. The repo convention for ModelMixin subclasses is to use dispatch_attention_fn with the AttentionModuleMixin / processor pattern (see .ai/models.md). AnimaTextConditioner inherits ModelMixin + PeftAdapterMixin and acts like a model. Using F.scaled_dot_product_attention directly means it won't benefit from attention backend dispatch (flash, sage, etc.).

This isn't blocking since the text conditioner is small relative to the transformer and runs once per generation (not in the denoising loop), but it diverges from repo convention. Consider adopting the dispatch_attention_fn pattern in a follow-up.

5. Unused `target_token_weights` parameter

AnimaTextConditioner.forward() declares target_token_weights (line 251) and uses it in the body (lines 287-288), but the pipeline never passes this argument. From the pipeline's call path, it's dead code. Per AGENTS.md: "do not carry unused method parameters 'for API consistency'". If this is intended for LoRA or external callers, document it; otherwise, remove it.

6. `text_embed_dim` in the test conditioner doesn't match `target_dim`

In test_anima.py:77, CosmosTransformer3DModel is created with text_embed_dim=16. The AnimaTextConditioner at line 97 has target_dim=16, which means prompt embeddings are 16-dim. However, the conditioner's out_proj maps model_dim → target_dim (16→16), and the output is what gets passed to the transformer as encoder_hidden_states. The pipeline feeds the output of text_conditioner (dim=target_dim) into the transformer which expects text_embed_dim. These match in the test (both 16), but the naming convention is worth documenting to avoid confusion — target_dim in the conditioner must equal text_embed_dim in the transformer config.

7. Consider `# Copied from` annotation for `retrieve_timesteps`

pipeline_anima.py:58 has # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps — good. Just making sure this passes make fix-copies.

8. Denoising loop differs significantly from Cosmos2

The Anima denoising loop (lines 498-544) doesn't use c_in/c_skip/c_out scaling, unlike the Cosmos2 pipeline which computes current_t = current_sigma / (current_sigma + 1) and applies velocity scaling. Since the model creator confirmed parity, this is presumably intentional (Anima was fine-tuned with a different training formulation). Just noting this for maintainer awareness — a brief comment in the pipeline noting the difference from standard Cosmos2 denoising would help future readers.

Suggestions / additional info: Dead code trace

I traced the call path from AnimaPipeline.__call__ through the CosmosTransformer3DModel.forward():

Parameters passed by the pipeline:

hidden_states ✅
timestep ✅
encoder_hidden_states ✅
padding_mask ✅
return_dict=False ✅

Parameters NOT passed (using defaults):

block_controlnet_hidden_states=None — controlnet residuals, not applicable for base txt2img
attention_mask=None — text attention mask is not passed through. The text conditioner already zero-masks padded positions via target_attention_mask, so the output embeddings have zeros for padding, but the transformer's cross-attention still attends to those zero-padded positions. This works because the embeddings are zeroed out, not because attention is masked. Fine in practice.
fps=None — video-only parameter, correct for image generation
condition_mask=None — used for video conditioning, correct for image generation

Under default configs (concat_padding_mask=True, extra_pos_embed_type=None in the test config):

self.learnable_pos_embed is None when extra_pos_embed_type=None → extra_pos_emb path is dead in this config
use_crossattn_projection=False (default) → crossattn_proj is never instantiated or called
img_context_dim_in=None (default) → img_context_proj never instantiated

These are all expected to be dead under the Anima config — they're Cosmos-specific features that Anima doesn't use. No issues here.

yiyixuxu · 2026-05-13T04:45:26Z

@rmatif

thanks for the PR! super excited to have Anima in diffusers

on high-level I have two feedbacks/questions:

(1) Can we explore having an AnimaTransformer3DModel instead of a standalone AnimaTextConditioner? Is there any reason you went with this design? I think it'd be more consistent with our conventions to have a dedicated model that (1) has shared layers copied from Cosmos with # Copied from statements and (2) includes this additional conditioning layer as a submodule. Additionally, it'd be closer to the original checkpoint format, which will make future support easier.

(2) Can we support this pipeline from Modular Diffusers directly? Given the very active community ecosystem and the continuous training/release style, Modular is a better fit — see the docs here: https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview. since you've already implemented in the standard pipeline, it would take a little refactor - happy to provide more info if interseted, we have pretty good docs for AI agents on this and I can point you to reference PRs as well

tdrussell · 2026-05-13T05:22:15Z

Regarding (1), what if we just subclassed the Cosmos DiT, like ComfyUI does?

The main reason to try to avoid duplicating code is that Anima's DiT architecture is identical to the Cosmos-Predict2 DiT. The only change is the LLM Adapter module (called AnimaTextConditioner in this PR). In ComfyUI the adapter lives as a submodule of the DiT for convenience, but it's not called in the forward() method since it only needs to run once for the entire diffusion process. So regardless of the structure, the pipeline code is going to be calling the adapter "manually" only once.

yiyixuxu · 2026-05-13T06:30:11Z

@tdrussell

what if we just subclassed the Cosmos DiT?

This isn't something we do in diffusers — all our models are self-contained and inherit from ModelMixin directly. We try to keep the code structure flat and easy to read

In ComfyUI the adapter lives as a submodule of the DiT for convenience, but it's not called in the forward() method since it only needs to run once for the entire diffusion process.

ohh, we usually include text condition layers in forward as well for simplicity — the performance tradeoff is typically non-significant. But if that's not the case for Anima, keeping it as a separate component like this PR would makes sense

rmatif · 2026-05-13T11:14:17Z

@yiyixuxu My preference is also to keep AnimaTextConditioner as a separate component in this PR

The main reason is that Anima’s DiT is not a new architecture. The denoiser weights and forward path are the Cosmos Predict2 DiT, the Anima-specific part is the LLM adapter that turns Qwen3 hidden states + T5 token ids into the encoder_hidden_states consumed by Cosmos

Since subclassing CosmosTransformer3DModel is not a Diffusers pattern, an AnimaTransformer3DModel would probably mean copying the Cosmos transformer into a new self-contained class just to add one adapter submodule. That feels worse to me for many reasons

The checkpoint conversion does split net.llm_adapter.* into text_conditioner/, but it is still strict and direct, there is no architectural remapping beyond separating the adapter from the unchanged Cosmos DiT

If the preference is still to make an AnimaTransformer3DModel, I can do that, but I think it would mostly be a wrapper/copy around Cosmos rather than a meaningfully different model class

And agree that Modular diffusers is a good fit for Anima, Would it be okay to handle Modular support in a follow-up PR?

tdrussell · 2026-05-13T14:44:57Z

@yiyixuxu

ohh, we usually include text condition layers in forward as well for simplicity — the performance tradeoff is typically non-significant. But if that's not the case for Anima

It's not the case for Anima. The LLM Adapter is 6 transformer layers with both self- and cross-attention, which is heavier than what is typical in most models (often just a single MLP projection layer). Anima basically has a mini text encoder that is converting from Qwen3 embedding space to T5XXL embedding space for input to the model. It's been a while since I ran the numbers, and I didn't write it down, but I recall the LLM Adapter as being ~10% of the full forward pass. IMO this is enough to warrant being called just once for the entire diffusion loop (and is what ComfyUI does as well).

yiyixuxu · 2026-05-13T17:21:35Z

sounds good to keep AnimaTextConditione then!

Can we support Anima only through Modular Diffusers, rather than maintaining both? We've been supporting new pipelines through both, but now that Modular is officially released we're looking to shift new pipelines to modular-only. Especially since we expect Anima to be be a very actively developed model, both from the author and the community, the maintenance cost from the standard pipeline could be quite high for us.

rmatif · 2026-05-13T19:37:47Z

Can we support Anima only through Modular Diffusers, rather than maintaining both?

Fair enough, I moved everything into Modular. Looking forward to your review

Here’s the updated example:

import torch
from diffusers import AnimaAutoBlocks
from diffusers.guiders import ClassifierFreeGuidance

pipe = AnimaAutoBlocks().init_pipeline("mrfatso/anima-preview3-diffusers")
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.update_components(guider=ClassifierFreeGuidance(guidance_scale=4.0))
pipe.to("cuda")

prompt = (
    "masterpiece, best quality, very aesthetic, absurdres, 1girl, solo, silver hair, blue eyes, "
    "long hair, school uniform, sailor collar, cherry blossoms, petals, spring, soft lighting, "
    "looking at viewer, upper body, detailed background"
)
negative_prompt = (
    "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, "
    "sepia, signature, artist name"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=25,
    generator=torch.Generator(device="cuda").manual_seed(12341),
).images[0]

image.save("anima.png")

yiyixuxu

i left one comment,
overall looks good to me, thanks for working on this

yiyixuxu · 2026-05-14T02:50:46Z

+        )
+
+
+class AnimaTextConditionerBlock(nn.Module):


ohhh but it is not a transformer. I think we have a couple of options:

Create a new folder under models/ for non-standard pipeline components.

Follow the same convention as in standard pipelines, host it under modular_pipelines/anima/text_conditioner.py. it requires a small change in modular from_pretrained() to work since the model is pipeline local and won't be importable on top-level

want to hear everyone's thoughts!

I think maybe it's time for (1) because it is just strange that we host model components under pipeline folders. the pipeline-local model structure was designed at the time we use same UNet and vae for every pipeline. A lot has changed since — all our models now follow the single-file pattern and pretty much every model is pipeline-specific. maybe we don't have to keep that distinction anymore

cc @DN6 here

rmatif added 2 commits May 11, 2026 17:40

Add Anima pipeline

f13a800

Fix empty Anima negative prompts

6842564

github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation fixes-issue lora tests pipelines loaders and removed size/L PR with diff > 200 LOC labels May 12, 2026

Fix Anima registration

aece3f3

github-actions Bot added size/L PR with diff > 200 LOC utils labels May 13, 2026

Clean up Anima conditioner

922c516

Refactor Anima to modular

f3bb403

github-actions Bot added models modular-pipelines and removed pipelines labels May 13, 2026

rmatif changed the title ~~Add Anima pipeline~~ Add Anima modular pipeline May 13, 2026

yiyixuxu reviewed May 14, 2026

View reviewed changes

yiyixuxu requested a review from sayakpaul May 14, 2026 03:25

Use modular loader in Anima docs

507f374

Conversation

rmatif commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before submitting

Who can review?

Uh oh!

tdrussell commented May 12, 2026

Uh oh!

asomoza commented May 12, 2026

Uh oh!

rmatif commented May 12, 2026

Uh oh!

sayakpaul commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #13732: Add Anima pipeline

Blocking issues

1. _toctree.yml — broken YAML structure (docs won't build)

2. Missing dummy objects

3. AnimaTextConditioner import ordering in src/diffusers/__init__.py

Non-blocking suggestions

4. F.scaled_dot_product_attention in AnimaTextConditionerAttention

5. Unused target_token_weights parameter

6. text_embed_dim in the test conditioner doesn't match target_dim

7. Consider # Copied from annotation for retrieve_timesteps

8. Denoising loop differs significantly from Cosmos2

Suggestions / additional info: Dead code trace

Uh oh!

yiyixuxu commented May 13, 2026

Uh oh!

tdrussell commented May 13, 2026

Uh oh!

yiyixuxu commented May 13, 2026

Uh oh!

rmatif commented May 13, 2026

Uh oh!

tdrussell commented May 13, 2026

Uh oh!

yiyixuxu commented May 13, 2026

Uh oh!

rmatif commented May 13, 2026

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yiyixuxu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rmatif commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

1. `_toctree.yml` — broken YAML structure (docs won't build)

3. `AnimaTextConditioner` import ordering in `src/diffusers/init.py`

4. `F.scaled_dot_product_attention` in `AnimaTextConditionerAttention`

5. Unused `target_token_weights` parameter

6. `text_embed_dim` in the test conditioner doesn't match `target_dim`

7. Consider `# Copied from` annotation for `retrieve_timesteps`